The aim of this project….
This project used data from 1500 residential property sales in Ames, Iowa between 2006 and 2012. There are 82 explanatory variables in the data set, containing - nominal, ordinal, discrete, and continuous attributes. Continuous variables provide information about the multiple area dimensions of the house and property, such as the the size of the lot, garage among others. Discrete variables, on the other hand, quantify characteristics of the house/properties like the number of kitchens, baths, bedrooms, and parking spots. Nominal variables, generally, describe the multiple types of materials and locations, such name of the neighborhood or the type of foundations. Ordinal variables typically rate the condition and quality of multiple house characteristics and utilities.
Prior to doing the exploratory data analysis, we hypothesize that the following variables will be the most predictive of home price: lot area, home type, year built, and overall quality. We think these will be the most predictive because we assume that if we were to be in the market for a home, these would be among the top criteria we would consider when deciding which home to purchase.
Furthermore, we also hypothesize that a generalized additive model (GAM) will be the best model to use. We think so because the GAM will be able to combine the strengths of various different other model types including polynomials, cubic splines, and smoothing splines.
Since our goal is to predict sale price, we first looked at the distribution of sale price in our data set.
What we observe from Figure 1 is that the distribution for sale price is right skewed. There a few houses with in the data set that tend to have relatively high prices. This is a limitation that we will further discuss in our limitation section of our discussion. We then proceed to analysis trend data for sale price. More, specifically, we explored how sale price varied acrous the year houses were sold.
Figure 2. allows us to see that the relationship between year sold and sale price is linear. Overall, it seems that there is an upward trend in sale price since the 1940s. We can also observe outliers across time. Our next step in our exploratory data analysis is to explore the variables we hypothesize will be strong predictors of price. We began by first exploring the variables themselves and then explore the relationship between these variables and sale price.
When it comes to lot area, this dataset has many outliers as shown above. We found that there were 127 outliers greater than the minimum outlier value of 17755. As these made visualization difficult, we temporarily removed them. After removing the outliers, we can see that homes have a somewhat normal distribution in terms of lot area near the median of 9436.5 square feet.
From Figure 3, we see that 1-story homes that were built in 1946 or later make up the bulk of our dataset, specifically 1079. This is over one-third of our total dataset which has 2930 observations. Please not that the graphs are interactive so move your cursor over the graph to see more details.
Furthermore, we can also observe from Figure 4, that most homes were built within a 5 year time range of 2005.
Exploring kitchen quality, from the table below, we can observe that the mean kitchen quality in this data is 3.51.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.511 4.000 5.000
We can observe from Figure 5 that there is a large variation in sale price across across different neighborhoods. Even within neighborhood we also see variation. Investigating some housing characteristics may give us insight into the variation observed in price within neighborhoods.
We first examined overall quality (Figure 6) and - as expected - price increases as overall quality increases. Examining year built (Figure 7), we observe that the the newer a home is, the higher its price, on average.
In addition investigating the relationship between sale price with location, overall quality, and age of the house, we also examined at the relationship between sale price and home type. We find that 2 story homes built in the year 1946 or later have the highest median home prices (Figure 8).
Figure 9 explores the relationship between kitchen quality and sale price.The higher the kitchen quality the higher the median sale price. This increase, however, is non-linear (but rather quadratic). From Figure 10, we can see that - as expected - there is a gradual positive relationship between lot area and sales price.
Missing data:
We opted for removing any missing observations from our final data set that were used for variable selection and modeling.
Modifying variable class:
We decided to keep the quality variables selected as a continuous variable as opposed to switching it to a factor. We did so because changing it to a factor would have lead to us dropping the “Very Poor” or “1” factor level as this level only has around 4 observations. By keeping the variable continuous, we are able to keep these observations and so better predict the home prices of homes that fall under this category.
Model Selection:
We began our model selection by reducing the number of variables within our housing data set. We created a subset data set that included the variables we hypothesized would important predictors of sale price.
These variables include:
LotArea: Lot size in square feetOverallQual: Rates the overall material and finish of the houseYearBuilt: Original construction dateExterior1st: Exterior covering on houseHeatingQC: Heating quality and conditionFoundation: Type of foundationTotRmsAbvGrd: Total rooms above grade (does not include bathrooms)KitchenQual: Kitchen qualityBsmtFinType1: Rating of basement finished areaNeighborhood: Physical locations within Ames city limitsLandSlope: Slope of propertyStreet: Type of road access to propertyHouseStyle: Style of dwellingGarageQual: Garage qualityFence: Fence qualityYrSold: Year Sold (YYYY)We further included additional variables that will be utilized later in the report to create a renovation calculator.
FullBath: Full bathrooms above gradeRoofStyle: Type of roofUsing our subset, we ran 1) a subset selection, (2) forward stepwise selection and (3) a forward stepwise selection for our variable selection. The graphs below are graphs that plot the number of variables against the BIC value for our three methods of variable selection.
Across all variable selection method, the a model with 7 variables has the lowest bIC score. Comparing the variables included in a model with seven variables across the three selection methods, we see that they all share the same variables.
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Bsmt.Qual |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| NeighborhoodNorthridge Heights |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| NeighborhoodNorthridge Heights |
| BsmtFin.Type.1GLQ |
Following our variable selection analysis, we proceeded to use those variables to fit a GAM model and Linear model to help us predict sale price.
We began by creating a 10-fold CV error estimates for polynomial regression, cubic splines, and smoothing splines models. The graphs below show the results of the cross validation, allowing us to determine the model and degrees of freedom that best fit the relationship between our selected numerical variables and sale price.
A degree 2 smoothing spline appears to be the best model choice for lot area. It has the lowest CV error and the lowest has the most stable curve.
A degree 6 smoothing spline appears to be the best fit for the total rooms above grade variable. While a lower degree cubic spine is comparable, the cubic spline becomes more unstable at higher degrees.
A degree 6 smoothing spline appears to be a good fit here, however other models appear to do comparably as well.
A quadratic polynomial appear to be the best fit for this model as it has the lowest error.
A cubic spline with 8 degrees of freedom appears to be the best model in this case. Other models are close in CV error and are fairly stable, but the cubic spline model has the lowest error.
The plot suggest that model that has the lowest cv error is a smoothing spline with 4 degrees of freedom
| model | RMSE | MAE |
|---|---|---|
| linear | 33512.93 | 23495.54 |
| gam | 31157.1 | 21034.14 |
Our hypothesis on model selection was correct. Examining RMSE and MAE for both the linear and gam \[models^{ii}\], we can observe that for both metrics the gam model out performs the linear model.
##
## Call: gam(formula = saleprice ~ s(lot_area, 2) + s(tot_rms_abv_grd,
## 6) + s(overall_qual, 6) + poly(Kitchen.Qual, 2) + bs(year_built,
## 8) + s(full_bath_abv_grd, 4) + Neighborhood + full_bath_abv_grd +
## Roof.Style + BsmtFin.Type.1, data = training)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -319638 -14814 -908 13116 211142
##
## (Dispersion Parameter for gaussian family taken to be 987751789)
##
## Null Deviance: 15347390784124 on 2391 degrees of freedom
## Residual Deviance: 2296522858852 on 2325 degrees of freedom
## AIC: 56396.82
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(lot_area, 2) 1 875284022212 875284022212 886.1376
## s(tot_rms_abv_grd, 6) 1 3113363822414 3113363822414 3151.9698
## s(overall_qual, 6) 1 6394247223301 6394247223301 6473.5365
## poly(Kitchen.Qual, 2) 2 324693372011 162346686005 164.3598
## bs(year_built, 8) 8 301174084064 37646760508 38.1136
## s(full_bath_abv_grd, 4) 1 26183048366 26183048366 26.5077
## Neighborhood 27 464638877046 17208847298 17.4222
## Roof.Style 5 18799865768 3759973154 3.8066
## BsmtFin.Type.1 6 137947181826 22991196971 23.2763
## Residuals 2325 2296522858852 987751789
## Pr(>F)
## s(lot_area, 2) < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) < 0.00000000000000022 ***
## s(overall_qual, 6) < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2) < 0.00000000000000022 ***
## bs(year_built, 8) < 0.00000000000000022 ***
## s(full_bath_abv_grd, 4) 0.0000002845 ***
## Neighborhood < 0.00000000000000022 ***
## Roof.Style 0.001949 **
## BsmtFin.Type.1 < 0.00000000000000022 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(lot_area, 2) 1 81.966 < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) 5 23.867 < 0.00000000000000022 ***
## s(overall_qual, 6) 5 50.247 < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2)
## bs(year_built, 8)
## s(full_bath_abv_grd, 4) 3 28.924 < 0.00000000000000022 ***
## Neighborhood
## full_bath_abv_grd
## Roof.Style
## BsmtFin.Type.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the summary output for the gam model, all of our variables are statistically significant at least to the p=.01 which suggests that these variables are relevant predictors for saleprice. This goes in line with part of out hypothesis that lot_area and overall_qual would be a statically significant predictors of saleprice. Contrary to our hypothesis, home_type and overall_qual are not statically significant predictors of saleprice.
Our goal was to create a baseline that would be the worst most common house. We calculated this approximation by taking the lowest number of full baths, the lowest kitchen quality, an unfinished basement, the mode of all the other variables in our dataset.
To determine the cost of improvements in full baths above grade, kitchen, roof, and basement we created a base sale price for comparison. This sale price for our base comparison consists of the following characteristics:
Based on our renovation calculator:
If you change your roof type from the most common roof type to any other roof type, on average, your house will go down $11124.3795492.
If you you go from an unfinished basement to any type of finished basement, our model predicts that on average, your house value will go down by $1084.6420984.
If you you make any upgrade to the kitchen from a kitchen with a quality 0- on average- your house value will go up by $34580.2351596.
If you you make any number of full bathrooms above grade (when you started with zero) - on average- your house value will go up by $49657.563354.
We then used our estimated costs from our renovation calculator to predict a new sale price for 2010 houses. We created four subsets dataframes: housing with the lowest kitchen quality, unfinished basement, gable roof style and 0 bathroom.
The predicted sales prices are displayed below.
lq.kitchen.df
## saleprice New_Saleprice
## 1 107500 142080.2
lq.bathroom.df
## saleprice New_Saleprice
## 1 144000 193657.6
## 2 260000 309657.6
head(lq.basement.df)
## saleprice New_Saleprice
## 1 189000 187915.36
## 2 175900 174815.36
## 3 180400 179315.36
## 4 88000 86915.36
## 5 120000 118915.36
## 6 376162 375077.36
head(lq.roof.df)
## saleprice New_Saleprice
## 1 105000 93875.62
## 2 189900 178775.62
## 3 195500 184375.62
## 4 213500 202375.62
## 5 191500 180375.62
## 6 236500 225375.62
Particular Home Improvement Recommendations
If you have the lowest quality kitchen, you should renovate it and upgrade to any better quality kitchen
If you have zero full bathrooms above grade, you should renovate and add at least 1 to get an increase of diff.bathroom on average
If you you have Gable roof, you should not renovate and choose roof type